UC Berkeley Library and D-Lab
Fall 2017
It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. (Dasu and Johnson, 2003)
Provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007.
gapminder <- read.csv("data/gapminder-FiveYearData.csv", stringsAsFactors = TRUE)
head(gapminder)## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
We can use base R functions to calculate summary statistics across groups of observations:
mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])## [1] 2193.755
mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])## [1] 7136.11
mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])## [1] 7902.15
Luckily, the dplyr package provides a number of very useful functions for manipulating dataframes. These functions will save you time by reducing repetition.
install.packages('dplyr')Here we’re going to cover 6 of the most commonly used functions as well as using pipes (%>%) to combine them.
select()filter()group_by()summarize()mutate()arrange()Now let’s load the package:
library(dplyr)Imagine that we just received the gapminder dataset, but are only interested in a few variables in it. We could use the select() function to keep only the variables we select.
year_country_gdp <- select(gapminder, year, country, gdpPercap)
head(year_country_gdp)## year country gdpPercap
## 1 1952 Afghanistan 779.4453
## 2 1957 Afghanistan 820.8530
## 3 1962 Afghanistan 853.1007
## 4 1967 Afghanistan 836.1971
## 5 1972 Afghanistan 739.9811
## 6 1977 Afghanistan 786.1134
year_country_gdp <- select(gapminder, year, country, gdpPercap)
head(year_country_gdp)## year country gdpPercap
## 1 1952 Afghanistan 779.4453
## 2 1957 Afghanistan 820.8530
## 3 1962 Afghanistan 853.1007
## 4 1967 Afghanistan 836.1971
## 5 1972 Afghanistan 739.9811
## 6 1977 Afghanistan 786.1134
The pipe operator is one of dplyr’s greatest strengths. It allows you combine several functions in a chain.
year_country_gdp <- gapminder %>% select(year,country,gdpPercap)is equivalent to:
year_country_gdp <- select(gapminder, year, country, gdpPercap)Now let’s say we’re only interested in African countries. We can combine select and filter to select only the observations where continent is Africa.
year_country_gdp_euro <- gapminder %>%
filter(continent == "Africa") %>%
select(year,country,gdpPercap)Logicals
| < | less than | != | not equal to |
|---|---|---|---|
| > | greater than | %in% | group membership |
| == | equal to | is.na | is NA |
| <= | less than or equal to | !is.na | is not NA |
| >= | greater than or equal to | &, | , !, xor, any, all |
Create a new object called africa_asia that is filtered by “Africa” and “Asia”*. Select all variables except for gdpPercap. Use the pipe operator to chain the functions together.
*hint: use the & boolean operator
A common task you’ll encounter when working with data is running calculations on different groups within the data. For instance, what if we wanted to calculated the mean GDP per capita for each continent? In base R, you would have to run the mean() function for each subset of data:
mean(gapminder$gdpPercap[gapminder$continent == "Africa"])## [1] 2193.755
mean(gapminder$gdpPercap[gapminder$continent == "Americas"])## [1] 7136.11
mean(gapminder$gdpPercap[gapminder$continent == "Asia"])## [1] 7902.15
mean(gapminder$gdpPercap[gapminder$continent == "Europe"])## [1] 14469.48
mean(gapminder$gdpPercap[gapminder$continent == "Oceania"])## [1] 18621.61
Assigning to data frames:
gapminder$mean.continent.GDP <- NA
gapminder$mean.continent.GDP[gapminder$continent == "Africa"] <- mean(gapminder$gdpPercap[gapminder$continent == "Africa"])
gapminder$mean.continent.GDP[gapminder$continent == "Americas"] <- mean(gapminder$gdpPercap[gapminder$continent == "Americas"])
gapminder$mean.continent.GDP[gapminder$continent == "Asia"] <- mean(gapminder$gdpPercap[gapminder$continent == "Asia"])
gapminder$mean.continent.GDP[gapminder$continent == "Europe"] <- mean(gapminder$gdpPercap[gapminder$continent == "Europe"])
gapminder$mean.continent.GDP[gapminder$continent == "Oceania"] <- mean(gapminder$gdpPercap[gapminder$continent == "Oceania"])gdp_bycontinents <- gapminder %>%
group_by(continent) %>%
summarize(mean_gdpPercap = mean(gdpPercap))
head(gdp_bycontinents)## # A tibble: 5 x 2
## continent mean_gdpPercap
## <fctr> <dbl>
## 1 Africa 2193.755
## 2 Americas 7136.110
## 3 Asia 7902.150
## 4 Europe 14469.476
## 5 Oceania 18621.609
group_by() multiple variables
gdp_bycontinents_byyear <- gapminder %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap))
head(gdp_bycontinents_byyear)## # A tibble: 6 x 3
## # Groups: continent [1]
## continent year mean_gdpPercap
## <fctr> <int> <dbl>
## 1 Africa 1952 1252.572
## 2 Africa 1957 1385.236
## 3 Africa 1962 1598.079
## 4 Africa 1967 2050.364
## 5 Africa 1972 2339.616
## 6 Africa 1977 2585.939
group_by() multiple variables and defining multiple variable with summarize()
gdp_pop_bycontinents_byyear <- gapminder %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop))
head(gdp_pop_bycontinents_byyear)## # A tibble: 6 x 6
## # Groups: continent [1]
## continent year mean_gdpPercap sd_gdpPercap mean_pop sd_pop
## <fctr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Africa 1952 1252.572 982.9521 4570010 6317450
## 2 Africa 1957 1385.236 1134.5089 5093033 7076042
## 3 Africa 1962 1598.079 1461.8392 5702247 7957545
## 4 Africa 1967 2050.364 2847.7176 6447875 8985505
## 5 Africa 1972 2339.616 3286.8539 7305376 10130833
## 6 Africa 1977 2585.939 4142.3987 8328097 11585184
lifeExp for each continentWhat if we wanted to add these values to our original data frame instead of creating a new object?
gapminder_with_extra_vars <- gapminder %>%
group_by(continent, year) %>%
mutate(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop))
head(gapminder_with_extra_vars)## # A tibble: 6 x 11
## # Groups: continent, year [6]
## country year pop continent lifeExp gdpPercap
## <fctr> <int> <dbl> <fctr> <dbl> <dbl>
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
## # ... with 5 more variables: mean.continent.GDP <dbl>,
## # mean_gdpPercap <dbl>, sd_gdpPercap <dbl>, mean_pop <dbl>, sd_pop <dbl>
We can use also use mutate() to create new variables prior to (or even after) summarizing information.
gdp_pop_bycontinents_byyear <- gapminder %>%
mutate(gdp_billion = gdpPercap*pop/10^9) %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop),
mean_gdp_billion = mean(gdp_billion),
sd_gdp_billion = sd(gdp_billion))
head(gdp_pop_bycontinents_byyear)## # A tibble: 6 x 8
## # Groups: continent [1]
## continent year mean_gdpPercap sd_gdpPercap mean_pop sd_pop
## <fctr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Africa 1952 1252.572 982.9521 4570010 6317450
## 2 Africa 1957 1385.236 1134.5089 5093033 7076042
## 3 Africa 1962 1598.079 1461.8392 5702247 7957545
## 4 Africa 1967 2050.364 2847.7176 6447875 8985505
## 5 Africa 1972 2339.616 3286.8539 7305376 10130833
## 6 Africa 1977 2585.939 4142.3987 8328097 11585184
## # ... with 2 more variables: mean_gdp_billion <dbl>, sd_gdp_billion <dbl>
gapminder_with_extra_vars <- gapminder %>%
group_by(continent, year) %>%
mutate(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop)) %>%
arrange(desc(year), continent)
head(gapminder_with_extra_vars)## # A tibble: 6 x 11
## # Groups: continent, year [1]
## country year pop continent lifeExp gdpPercap
## <fctr> <int> <dbl> <fctr> <dbl> <dbl>
## 1 Algeria 2007 33333216 Africa 72.301 6223.3675
## 2 Angola 2007 12420476 Africa 42.731 4797.2313
## 3 Benin 2007 8078314 Africa 56.728 1441.2849
## 4 Botswana 2007 1639131 Africa 50.728 12569.8518
## 5 Burkina Faso 2007 14326203 Africa 52.295 1217.0330
## 6 Burundi 2007 8390505 Africa 49.580 430.0707
## # ... with 5 more variables: mean.continent.GDP <dbl>,
## # mean_gdpPercap <dbl>, sd_gdpPercap <dbl>, mean_pop <dbl>, sd_pop <dbl>
Use dplyr to add a column to the gapminder dataset that contains the total population of the continent of each observation in a given year. For example, if the first observation is Afghanistan in 1952, the new column would contain the population of Asia in 1952.
Use dplyr to: (a) add a column called gdpPercap_diff that contains the difference between the observation’s gdpPercap and the mean gdpPercap of the continent in that year, (b) arrange the dataframe by the column you just created, in descending order (so that the relatively richest country/years are listed first)
Even before we conduct analysis or calculations, we need to put our data into the correct format. The goal here is to rearrange a messy dataset into one that is tidy
The two most important properties of tidy data are:
“Happy families are all alike; every unhappy family is unhappy in its own way.” - Leo Tolstoy
“Tidy datasets are all alike but every messy dataset is messy in its own way.” – Hadley Wickham
wide## name time1 time2 time3
## 1 Wilbur 67 56 70
## 2 Petunia 80 90 67
## 3 Gregory 64 50 101
long## name time heartrate
## 1 Wilbur 1 67
## 2 Petunia 1 80
## 3 Gregory 1 64
## 4 Wilbur 2 56
## 5 Petunia 2 90
## 6 Gregory 2 50
## 7 Wilbur 3 70
## 8 Petunia 3 67
## 9 Gregory 3 10
Which one of these do you think is the tidy format?
This data frame is somewhere in between the purely ‘long’ and ‘wide’ formats. We have 3 “ID variables” (continent, country, year) and 3 “Observation variables” (pop, lifeExp, gdpPercap).
head(gapminder)## country year pop continent lifeExp gdpPercap mean.continent.GDP
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453 7902.15
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530 7902.15
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007 7902.15
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971 7902.15
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811 7902.15
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134 7902.15
library(tidyr)gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE)
head(gap_wide)## continent country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962
## 1 Africa Algeria 2449.0082 3013.9760 2550.8169
## 2 Africa Angola 3520.6103 3827.9405 4269.2767
## 3 Africa Benin 1062.7522 959.6011 949.4991
## 4 Africa Botswana 851.2411 918.2325 983.6540
## 5 Africa Burkina Faso 543.2552 617.1835 722.5120
## 6 Africa Burundi 339.2965 379.5646 355.2032
## gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982
## 1 3246.9918 4182.6638 4910.4168 5745.1602
## 2 5522.7764 5473.2880 3008.6474 2756.9537
## 3 1035.8314 1085.7969 1029.1613 1277.8976
## 4 1214.7093 2263.6111 3214.8578 4551.1421
## 5 794.8266 854.7360 743.3870 807.1986
## 6 412.9775 464.0995 556.1033 559.6032
## gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002
## 1 5681.3585 5023.2166 4797.2951 5288.0404
## 2 2430.2083 2627.8457 2277.1409 2773.2873
## 3 1225.8560 1191.2077 1232.9753 1372.8779
## 4 6205.8839 7954.1116 8647.1423 11003.6051
## 5 912.0631 931.7528 946.2950 1037.6452
## 6 621.8188 631.6999 463.1151 446.4035
## gdpPercap_2007 lifeExp_1952 lifeExp_1957 lifeExp_1962 lifeExp_1967
## 1 6223.3675 43.077 45.685 48.303 51.407
## 2 4797.2313 30.015 31.999 34.000 35.985
## 3 1441.2849 38.223 40.358 42.618 44.885
## 4 12569.8518 47.622 49.618 51.520 53.298
## 5 1217.0330 31.975 34.906 37.814 40.697
## 6 430.0707 39.031 40.533 42.045 43.548
## lifeExp_1972 lifeExp_1977 lifeExp_1982 lifeExp_1987 lifeExp_1992
## 1 54.518 58.014 61.368 65.799 67.744
## 2 37.928 39.483 39.942 39.906 40.647
## 3 47.014 49.190 50.904 52.337 53.919
## 4 56.024 59.319 61.484 63.622 62.745
## 5 43.591 46.137 48.122 49.557 50.260
## 6 44.057 45.910 47.471 48.211 44.736
## lifeExp_1997 lifeExp_2002 lifeExp_2007 pop_1952 pop_1957 pop_1962
## 1 69.152 70.994 72.301 9279525 10270856 11000948
## 2 40.963 41.003 42.731 4232095 4561361 4826015
## 3 54.777 54.406 56.728 1738315 1925173 2151895
## 4 52.556 46.634 50.728 442308 474639 512764
## 5 50.324 50.650 52.295 4469979 4713416 4919632
## 6 45.326 47.360 49.580 2445618 2667518 2961915
## pop_1967 pop_1972 pop_1977 pop_1982 pop_1987 pop_1992 pop_1997 pop_2002
## 1 12760499 14760787 17152804 20033753 23254956 26298373 29072015 31287142
## 2 5247469 5894858 6162675 7016384 7874230 8735988 9875024 10866106
## 3 2427334 2761407 3168267 3641603 4243788 4981671 6066080 7026113
## 4 553541 619351 781472 970347 1151184 1342614 1536536 1630347
## 5 5127935 5433886 5889574 6634596 7586551 8878303 10352843 12251209
## 6 3330989 3529983 3834415 4580410 5126023 5809236 6121610 7021078
## pop_2007
## 1 33333216
## 2 12420476
## 3 8078314
## 4 1639131
## 5 14326203
## 6 8390505
gap_long <- gap_wide %>%
gather(obstype_year, obs_values, 3:38)
head(gap_long)## continent country obstype_year obs_values
## 1 Africa Algeria gdpPercap_1952 2449.0082
## 2 Africa Angola gdpPercap_1952 3520.6103
## 3 Africa Benin gdpPercap_1952 1062.7522
## 4 Africa Botswana gdpPercap_1952 851.2411
## 5 Africa Burkina Faso gdpPercap_1952 543.2552
## 6 Africa Burundi gdpPercap_1952 339.2965
We put 3 arguments into the gather() function:
the name of the new column for the new ID variable (obstype_year)
the name for the new amalgamated observation variable (obs_value)
the indices of the old observation variables (3:38, signalling columns 3 through 38) that we want to gather into one variable. Notice that we don’t want to melt down columns 1 and 2, as these are considered “ID” variables.
We can also select observation variables using:
x:z to select all variables between x and z-y to exclude ystarts_with(x, ignore.case = TRUE): all names that starts with xends_with(x, ignore.case = TRUE): all names that ends with xcontains(x, ignore.case = TRUE): all names that contain xThe same thing with the starts_with function
# with the starts_with() function
gap_long <- gap_wide %>%
gather(obstype_year, obs_values, starts_with('pop'),
starts_with('lifeExp'), starts_with('gdpPercap'))
head(gap_long)## continent country obstype_year obs_values
## 1 Africa Algeria pop_1952 9279525
## 2 Africa Angola pop_1952 4232095
## 3 Africa Benin pop_1952 1738315
## 4 Africa Botswana pop_1952 442308
## 5 Africa Burkina Faso pop_1952 4469979
## 6 Africa Burundi pop_1952 2445618
the - operator:
gap_long <- gap_wide %>%
gather(obstype_year, obs_values, -continent, -country)
head(gap_long)## continent country obstype_year obs_values
## 1 Africa Algeria gdpPercap_1952 2449.0082
## 2 Africa Angola gdpPercap_1952 3520.6103
## 3 Africa Benin gdpPercap_1952 1062.7522
## 4 Africa Botswana gdpPercap_1952 851.2411
## 5 Africa Burkina Faso gdpPercap_1952 543.2552
## 6 Africa Burundi gdpPercap_1952 339.2965
in our long dataset, obstype_year actually contains 2 pieces of information, the observation type (pop, lifeExp, or gdpPercap) and the year.
We can use the separate() function to split the character strings into multiple variables:
gap_long_sep <- gap_long %>%
separate(obstype_year, into = c('obs_type','year'), sep = "_") %>%
mutate(year = as.integer(year))
head(gap_long_sep)## continent country obs_type year obs_values
## 1 Africa Algeria gdpPercap 1952 2449.0082
## 2 Africa Angola gdpPercap 1952 3520.6103
## 3 Africa Benin gdpPercap 1952 1062.7522
## 4 Africa Botswana gdpPercap 1952 851.2411
## 5 Africa Burkina Faso gdpPercap 1952 543.2552
## 6 Africa Burundi gdpPercap 1952 339.2965
The opposite of gather() is spread(). It spreads our observation variables back out to make a wider table. We can use this function to spread our gap_long() to the original “medium” format.
gap_medium <- gap_long_sep %>%
spread(obs_type, obs_values)
head(gap_medium)## continent country year gdpPercap lifeExp pop
## 1 Africa Algeria 1952 2449.008 43.077 9279525
## 2 Africa Algeria 1957 3013.976 45.685 10270856
## 3 Africa Algeria 1962 2550.817 48.303 11000948
## 4 Africa Algeria 1967 3246.992 51.407 12760499
## 5 Africa Algeria 1972 4182.664 54.518 14760787
## 6 Africa Algeria 1977 4910.417 58.014 17152804
All we need is some quick fixes to make this dataset identical to the original gapminder dataset:
gapminder <- read.csv("data/gapminder-FiveYearData.csv")
head(gapminder)## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
# rearrange columns
gap_medium <- gap_medium[,names(gapminder)]
head(gap_medium)## country year pop continent lifeExp gdpPercap
## 1 Algeria 1952 9279525 Africa 43.077 2449.008
## 2 Algeria 1957 10270856 Africa 45.685 3013.976
## 3 Algeria 1962 11000948 Africa 48.303 2550.817
## 4 Algeria 1967 12760499 Africa 51.407 3246.992
## 5 Algeria 1972 14760787 Africa 54.518 4182.664
## 6 Algeria 1977 17152804 Africa 58.014 4910.417
# arrange by country, continent, and year
gap_medium <- gap_medium %>%
arrange(country,continent,year)
head(gap_medium)## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
dplyr and tidyr have many more functions to help you wrangle and manipulate your data. See the Data Wrangling Cheat Sheet for more.
Subset the results from challenge 3 to select only the country, year, and gdpPercap_diff columns. Use tidyr put it in wide format so that countries are rows and years are columns.
Now turn the dataframe above back into the long format with three columns: country, year, and gdpPercap_diff.